[SPARK-32346][SQL] Support filters pushdown in Avro datasource#29145
[SPARK-32346][SQL] Support filters pushdown in Avro datasource#29145MaxGekk wants to merge 23 commits intoapache:masterfrom
Conversation
|
Test build #126053 has finished for PR 29145 at commit
|
|
Test build #126063 has finished for PR 29145 at commit
|
|
Test build #126072 has finished for PR 29145 at commit
|
|
Test build #126074 has finished for PR 29145 at commit
|
|
Test build #126114 has finished for PR 29145 at commit
|
|
Test build #126115 has finished for PR 29145 at commit
|
|
Test build #126128 has finished for PR 29145 at commit
|
|
@gengliangwang @dongjoon-hyun @HyukjinKwon @cloud-fan Please, take a look at this PR. |
|
Thank you for pinging me, @MaxGekk . |
|
@dongjoon-hyun I am looking forward to review comments from you. |
|
Test build #126199 has finished for PR 29145 at commit
|
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Show resolved
Hide resolved
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
Outdated
Show resolved
Hide resolved
|
Test build #126255 has finished for PR 29145 at commit
|
|
retest this please |
|
Test build #126303 has finished for PR 29145 at commit
|
|
jenkins, retest this, please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala
Show resolved
Hide resolved
|
Test build #126310 has finished for PR 29145 at commit
|
|
Test build #126339 has finished for PR 29145 at commit
|
…checking out ### What changes were proposed in this pull request? Refactoring of `JsonFilters`: - Add an assert to the `skipRow` method to check the input `index` - Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`. ### Why are the changes needed? 1. The assert should catch incorrect usage of `JsonFilters` 2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see #29145). 3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*" $ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json.*" ``` Closes #29206 from MaxGekk/json-filters-pushdown-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
@cloud-fan Please, review this PR. |
|
@gengliangwang Are you ok with this PR? |
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Show resolved
Hide resolved
gengliangwang
left a comment
There was a problem hiding this comment.
LGTM except for one comment
|
Thanks, merging to master |
What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.
spark.sql.avro.filterPushdown.enabledto control filters pushdown to Avro datasource. It is on by default.CSVFilterstoOrderedFilters.OrderedFiltersis used inAvroFileFormat(DSv1) and inAvroPartitionReaderFactory(DSv2)AvroDeserializerto return None from thedeserializemethod when pushdown filters returnfalse.Why are the changes needed?
The changes improve performance on synthetic benchmarks up to 2 times on JDK 11:
Does this PR introduce any user-facing change?
No
How was this patch tested?
AvroCatalystDataConversionSuiteandAvroSuiteAvroReadBenchmarkusing Amazon EC2:sudo add-apt-repository ppa:openjdk-r/ppa&sudo apt install openjdk-11-jdkand
./dev/run-benchmarks: